On Some Pitfalls in Automatic Evaluation and Significance Testing for MT
نویسندگان
چکیده
We investigate some pitfalls regarding the discriminatory power of MT evaluation metrics and the accuracy of statistical significance tests. In a discriminative reranking experiment for phrase-based SMT we show that the NIST metric is more sensitive than BLEU or F-score despite their incorporation of aspects of fluency or meaning adequacy into MT evaluation. In an experimental comparison of two statistical significance tests we show that p-values are estimated more conservatively by approximate randomization than by bootstrap tests, thus increasing the likelihood of type-I error for the latter. We point out a pitfall of randomly assessing significance in multiple pairwise comparisons, and conclude with a recommendation to combine NIST with approximate randomization, at more stringent rejection levels than is currently standard.
منابع مشابه
The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملAnalysis of motor fan radiated sound and vibration waveform by automatic pattern recognition technique using “Mahalanobis distance”
In recent years, as the weight of IT equipment has been reduced, the demand for motor fans for cooling the interior of electronic equipment is on the rise. Sensory test technique by inspectors is the mainstream for quality inspection of motor fans in the field. This sensory test requires a lot of experience to accurately diagnose differences in subtle sounds (sound pressures) of the fans, and t...
متن کاملمقایسۀ کاربرد انواع روشهای ارزیابی دسترسپذیری وبسایتها مطالعۀ موردی: وبسایت وزارتخانههای دولت جمهوری اسلامی ایران)
Purpose: The present research aims to comparatively study different methods for evaluating the accessibility of websites and analyze the results of case study concerning websites of ministries of Iranian government, in order to indicate the strengths, weaknesses, and differences in evaluation findings by applying each of website accessibility methods. Methodology: In this paper, initially the ...
متن کاملAll in Strings: a Powerful String-based Automatic MT Evaluation Metric with Multiple Granularities
String-based metrics of automatic machine translation (MT) evaluation are widely applied in MT research. Meanwhile, some linguistic motivated metrics have been suggested to improve the string-based metrics in sentencelevel evaluation. In this work, we attempt to change their original calculation units (granularities) of string-based metrics to generate new features. We then propose a powerful s...
متن کاملA Petri-net based modeling tool, for analysis and evaluation of computer systems
Petri net is one of the most popular methods in modeling and evaluation of concurrent and event-based systems. Different tools have been created to support modeling and simulation of different extensions of Petri net in different applications. Each tool supports some extensions and some features. In this work a Petri net based modeling and evaluation tool is presented that not only supports dif...
متن کامل